skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "DeGiorgio, Michael"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Background: Genetic variation provides a foundation for understanding evolution. With the rise of artificial intelligence, machine learning has emerged as a powerful tool for identifying genomic footprints of evolutionary processes through simulation-based predictive modeling. However, existing approaches require prior knowledge of the factors shaping genetic variation, whereas uncovering anomalous genomic regions regardless of their causes remains an equally important and complementary endeavor. Methods: To address this problem, we introduce ANDES (ANomaly DEtection using Summary statistics), a suite of algorithms that apply statistical techniques to extract features for unsupervised anomaly detection. A key innovation of ANDES is its ability to account for autocovariation due to linkage disequilibrium by fitting curves to contiguous windows and computing their first and second derivatives, thereby capturing the “velocity” and “acceleration” of genetic variation. These features are then used to train models that flag biologically significant or artifactual regions. Results: Application to human genomic data demonstrates that ANDES successfully detects anomalous regions that colocalize with genes under positive or balancing selection. Moreover, these analyses reveal a non-uniform distribution of anomalies, which are enriched in specific autosomes, intergenic regions, introns, and regions with low GC content, repetitive sequences, and poor mappability. Conclusions: ANDES thus offers a novel, model-agnostic framework for uncovering anomalous genomic regions in both model and non-model organisms. 
    more » « less
    Free, publicly-accessible full text available June 1, 2026
  2. Abstract Natural selection leaves detectable patterns of altered spatial diversity within genomes, and identifying affected regions is crucial for understanding species evolution. Recently, machine learning approaches applied to raw population genomic data have been developed to uncover these adaptive signatures. Convolutional neural networks (CNNs) are particularly effective for this task, as they handle large data arrays while maintaining element correlations. However, shallow CNNs may miss complex patterns due to their limited capacity, while deep CNNs can capture these patterns but require extensive data and computational power. Transfer learning addresses these challenges by utilizing a deep CNN pretrained on a large dataset as a feature extraction tool for downstream classification and evolutionary parameter prediction. This approach reduces extensive training data generation requirements and computational needs while maintaining high performance. In this study, we developed TrIdent, a tool that uses transfer learning to enhance detection of adaptive genomic regions from image representations of multilocus variation. We evaluated TrIdent across various genetic, demographic, and adaptive settings, in addition to unphased data and other confounding factors. TrIdent demonstrated improved detection of adaptive regions compared to recent methods using similar data representations. We further explored model interpretability through class activation maps and adapted TrIdent to infer selection parameters for identified adaptive candidates. Using whole-genome haplotype data from European and African populations, TrIdent effectively recapitulated known sweep candidates and identified novel cancer, and other disease-associated genes as potential sweeps. 
    more » « less
  3. Abstract Just exactly which tree(s) should we assume when testing evolutionary hypotheses? This question has plagued comparative biologists for decades. Though all phylogenetic comparative methods require input trees, we seldom know with certainty whether even a perfectly estimated tree (if this is possible in practice) is appropriate for our studied traits. Yet, we also know that phylogenetic conflict is ubiquitous in modern comparative biology, and we are still learning about its dangers when testing evolutionary hypotheses. Here, we investigate the consequences of tree-trait mismatch for phylogenetic regression in the presence of gene tree–species tree conflict. Our simulation experiments reveal excessively high false positive rates for mismatched models with both small and large trees, simple and complex traits, and known and estimated phylogenies. In some cases, we find evidence of a directionality of error: assuming a species tree for traits that evolved according to a gene tree sometimes fares worse than the opposite. We also explored the impacts of tree choice using an expansive, cross-species gene expression dataset as an arguably “best-case” scenario in which one may have a better chance of matching tree with trait. Offering a potential path forward, we found promise in the application of a robust estimator as a potential, albeit imperfect, solution to some issues raised by tree mismatch. Collectively, our results emphasize the importance of careful study design for comparative methods, highlighting the need to fully appreciate the role of accurate and thoughtful phylogenetic modeling. 
    more » « less
  4. Abstract In recent years, advances in image processing and machine learning have fueled a paradigm shift in detecting genomic regions under natural selection. Early machine learning techniques employed population-genetic summary statistics as features, which focus on specific genomic patterns expected by adaptive and neutral processes. Though such engineered features are important when training data are limited, the ease at which simulated data can now be generated has led to the recent development of approaches that take in image representations of haplotype alignments and automatically extract important features using convolutional neural networks. Digital image processing methods termed α-molecules are a class of techniques for multiscale representation of objects that can extract a diverse set of features from images. One such α-molecule method, termed wavelet decomposition, lends greater control over high-frequency components of images. Another α-molecule method, termed curvelet decomposition, is an extension of the wavelet concept that considers events occurring along curves within images. We show that application of these α-molecule techniques to extract features from image representations of haplotype alignments yield high true positive rate and accuracy to detect hard and soft selective sweep signatures from genomic data with both linear and nonlinear machine learning classifiers. Moreover, we find that such models are easy to visualize and interpret, with performance rivaling those of contemporary deep learning approaches for detecting sweeps. 
    more » « less
  5. Abstract MotivationThe scale and scope of comparative trait data are expanding at unprecedented rates, and recent advances in evolutionary modeling and simulation sometimes struggle to match this pace. Well-organized and flexible applications for conducting large-scale simulations of evolution hold promise in this context for understanding models and more so our ability to confidently estimate them with real trait data sampled from nature. ResultsWe introduce TraitTrainR, an R package designed to facilitate efficient, large-scale simulations under complex models of continuous trait evolution. TraitTrainR employs several output formats, supports popular trait data transformations, accommodates multi-trait evolution, and exhibits flexibility in defining input parameter space and model stacking. Moreover, TraitTrainR permits measurement error, allowing for investigation of its potential impacts on evolutionary inference. We envision a wealth of applications of TraitTrainR, and we demonstrate one such example by examining the problem of evolutionary model selection in three empirical phylogenetic case studies. Collectively, these demonstrations of applying TraitTrainR to explore problems in model selection underscores its utility and broader promise for addressing key questions, including those related to experimental design and statistical power, in comparative biology. Availability and implementationTraitTrainR is developed in R 4.4.0 and is freely available at https://github.com/radamsRHA/TraitTrainR/, which includes detailed documentation, quick-start guides, and a step-by-step tutorial. 
    more » « less
  6. Abstract MotivationGene deletion is traditionally thought of as a nonadaptive process that removes functional redundancy from genomes, such that it generally receives less attention than duplication in evolutionary turnover studies. Yet, mounting evidence suggests that deletion may promote adaptation via the “less-is-more” evolutionary hypothesis, as it often targets genes harboring unique sequences, expression profiles, and molecular functions. Hence, predicting the relative prevalence of redundant and unique functions among genes targeted by deletion, as well as the parameters underlying their evolution, can shed light on the role of gene deletion in adaptation. ResultsHere, we present CLOUDe, a suite of machine learning methods for predicting evolutionary targets of gene deletion events from expression data. Specifically, CLOUDe models expression evolution as an Ornstein–Uhlenbeck process, and uses multi-layer neural network, extreme gradient boosting, random forest, and support vector machine architectures to predict whether deleted genes are “redundant” or “unique”, as well as several parameters underlying their evolution. We show that CLOUDe boasts high power and accuracy in differentiating between classes, and high accuracy and precision in estimating evolutionary parameters, with optimal performance achieved by its neural network architecture. Application of CLOUDe to empirical data from Drosophila suggests that deletion primarily targets genes with unique functions, with further analysis showing these functions to be enriched for protein deubiquitination. Thus, CLOUDe represents a key advance in learning about the role of gene deletion in functional evolution and adaptation. Availability and implementationCLOUDe is freely available on GitHub (https://github.com/anddssan/CLOUDe). 
    more » « less
  7. Abstract Modern comparative biology owes much to phylogenetic regression. At its conception, this technique sparked a revolution that armed biologists with phylogenetic comparative methods (PCMs) for disentangling evolutionary correlations from those arising from hierarchical phylogenetic relationships. Over the past few decades, the phylogenetic regression framework has become a paradigm of modern comparative biology that has been widely embraced as a remedy for shared ancestry. However, recent evidence has shown doubt over the efficacy of phylogenetic regression, and PCMs more generally, with the suggestion that many of these methods fail to provide an adequate defense against unreplicated evolution—the primary justification for using them in the first place. Importantly, some of the most compelling examples of biological innovation in nature result from abrupt lineage-specific evolutionary shifts, which current regression models are largely ill equipped to deal with. Here we explore a solution to this problem by applying robust linear regression to comparative trait data. We formally introduce robust phylogenetic regression to the PCM toolkit with linear estimators that are less sensitive to model violations than the standard least-squares estimator, while still retaining high power to detect true trait associations. Our analyses also highlight an ingenuity of the original algorithm for phylogenetic regression based on independent contrasts, whereby robust estimators are particularly effective. Collectively, we find that robust estimators hold promise for improving tests of trait associations and offer a path forward in scenarios where classical approaches may fail. Our study joins recent arguments for increased vigilance against unreplicated evolution and a better understanding of evolutionary model performance in challenging—yet biologically important—settings. 
    more » « less
  8. Satta, Yoko (Ed.)
    Abstract Likelihood-based tests of phylogenetic trees are a foundation of modern systematics. Over the past decade, an enormous wealth and diversity of model-based approaches have been developed for phylogenetic inference of both gene trees and species trees. However, while many techniques exist for conducting formal likelihood-based tests of gene trees, such frameworks are comparatively underdeveloped and underutilized for testing species tree hypotheses. To date, widely used tests of tree topology are designed to assess the fit of classical models of molecular sequence data and individual gene trees and thus are not readily applicable to the problem of species tree inference. To address this issue, we derive several analogous likelihood-based approaches for testing topologies using modern species tree models and heuristic algorithms that use gene tree topologies as input for maximum likelihood estimation under the multispecies coalescent. For the purpose of comparing support for species trees, these tests leverage the statistical procedures of their original gene tree-based counterparts that have an extended history for testing phylogenetic hypotheses at a single locus. We discuss and demonstrate a number of applications, limitations, and important considerations of these tests using simulated and empirical phylogenomic data sets that include both bifurcating topologies and reticulate network models of species relationships. Finally, we introduce the open-source R package SpeciesTopoTestR (SpeciesTopology Tests in R) that includes a suite of functions for conducting formal likelihood-based tests of species topologies given a set of input gene tree topologies. 
    more » « less
  9. Abstract Inferences of adaptive events are important for learning about traits, such as human digestion of lactose after infancy and the rapid spread of viral variants. Early efforts toward identifying footprints of natural selection from genomic data involved development of summary statistic and likelihood methods. However, such techniques are grounded in simple patterns or theoretical models that limit the complexity of settings they can explore. Due to the renaissance in artificial intelligence, machine learning methods have taken center stage in recent efforts to detect natural selection, with strategies such as convolutional neural networks applied to images of haplotypes. Yet, limitations of such techniques include estimation of large numbers of model parameters under nonconvex settings and feature identification without regard to location within an image. An alternative approach is to use tensor decomposition to extract features from multidimensional data although preserving the latent structure of the data, and to feed these features to machine learning models. Here, we adopt this framework and present a novel approach termed T-REx, which extracts features from images of haplotypes across sampled individuals using tensor decomposition, and then makes predictions from these features using classical machine learning methods. As a proof of concept, we explore the performance of T-REx on simulated neutral and selective sweep scenarios and find that it has high power and accuracy to discriminate sweeps from neutrality, robustness to common technical hurdles, and easy visualization of feature importance. Therefore, T-REx is a powerful addition to the toolkit for detecting adaptive processes from genomic data. 
    more » « less
  10. Mutually beneficial partnerships between genomics researchers and North American Indigenous Nations are rare yet becoming more common. Here, we present one such partnership that provides insight into the peopling of the Americas and furnishes another line of evidence that can be used to further treaty and Indigenous rights. We show that the genomics of sampled individuals from the Blackfoot Confederacy belong to a previously undescribed ancient lineage that diverged from other genomic lineages in the Americas in Late Pleistocene times. Using multiple complementary forms of knowledge, we provide a scenario for Blackfoot population history that fits with oral tradition and provides a plausible model for the evolutionary process of the peopling of the Americas. 
    more » « less